Hi guys, this is a machine learnig tutorial for R users. We have a wonderful tutorial for python users and not for R, so i thought i should make one. Actually, i don’t know, how many of the top 1% use R, but i still can say that if there are any R users in top 1, they will be few in numbers. This is because of the complexity involved in this dataset and less resources for clustering in R, so it becomes quite challenging for R users. I hope, this kernel lets R users to start somewhere. This kernel running time is somewhere around 8 minutes, means we can get a predictin in 8 minutes. This is tested on 4gb RAM and i5-4 laptop.
Note - I may not evaluate some parts of this kernel to make sure that this kernel stays under kaggle kernel running limits. If you want, you can fork the notebook, download the code and try it in your local R environment.
In this competition, Our target variable is trip_duration, and we have to predict it. Train dataset has some 11 columns and test dataset has some 9 columns, test dataset doesn’t have trip_duration variable and dropoff_datatime variable.
Train dand test dataset is from the same period.
First, we load the necessary libraries to do our analysis
library(data.table)
library(dplyr)
library(ggplot2)
library(highcharter)
library(plotly)
library(leaflet)
library(Metrics)
library(edgebundleR)
library(corrplot)
library(tidyr)
library(reshape2)
library(DT)
library(FNN)
library(highcharter)
library(lubridate)
library(geosphere)
library(broom)
library(xgboost)
We load in all five datasets in this kernel. Two from the competition and other three from the added data resources by other kagglers.
train <- fread("train.csv")
test <- fread("test.csv")
osrm_t1 <- fread("fastest_routes_train_part_1.csv")
osrm_t2 <- fread("fastest_routes_train_part_2.csv")
osrm_test <- fread("fastest_routes_test.csv")
train[, filter:= 0]
test[, filter:= 1]
dataset <- rbindlist(list(train, test), fill = T)
Every now and then, we will use the combined(train and test both) dataset to calculate some statistics or extract a new feature
summary lets us get the centre of distribution of any feature inside the dataset, it also tells us the number of NA’s, if present, in a dataset.
summary(dataset)
## id vendor_id pickup_datetime dropoff_datetime
## Length:2083778 Min. :1.000 Length:2083778 Length:2083778
## Class :character 1st Qu.:1.000 Class :character Class :character
## Mode :character Median :2.000 Mode :character Mode :character
## Mean :1.535
## 3rd Qu.:2.000
## Max. :2.000
##
## passenger_count pickup_longitude pickup_latitude dropoff_longitude
## Min. :0.000 Min. :-121.93 Min. :34.36 Min. :-121.93
## 1st Qu.:1.000 1st Qu.: -73.99 1st Qu.:40.74 1st Qu.: -73.99
## Median :1.000 Median : -73.98 Median :40.75 Median : -73.98
## Mean :1.664 Mean : -73.97 Mean :40.75 Mean : -73.97
## 3rd Qu.:2.000 3rd Qu.: -73.97 3rd Qu.:40.77 3rd Qu.: -73.96
## Max. :9.000 Max. : -61.34 Max. :51.88 Max. : -61.34
##
## dropoff_latitude store_and_fwd_flag trip_duration filter
## Min. :32.18 Length:2083778 Min. : 1 Min. :0.0
## 1st Qu.:40.74 Class :character 1st Qu.: 397 1st Qu.:0.0
## Median :40.75 Mode :character Median : 662 Median :0.0
## Mean :40.75 Mean : 959 Mean :0.3
## 3rd Qu.:40.77 3rd Qu.: 1075 3rd Qu.:1.0
## Max. :48.86 Max. :3526282 Max. :1.0
## NA's :625134
There are no NAs in either train or test dataset, so we are good to go
Our prediction is going to be evaluated with RMSLE metric, this is derived from the RMSE, and this just the RMSE with log transformed target and prediction. This means instead of using raw trip duration, we take log-trip_duration as target variable, and when submitting our prediction, we exponentiate the prediction to bring it into original scale.
Whenever you start with any competition, make sure you perform these 2Ps right at the start. First P means Preprocess. Preprocessing involves scaling, standardizing, transformations and stuffs like that. Here, we do log transformation of the target variable.
train[, trip_duration:= log(trip_duration)]
## id vendor_id pickup_datetime dropoff_datetime
## 1: id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30
## 2: id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38
## 3: id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48
## 4: id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40
## 5: id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10
## ---
## 1458640: id2376096 2 2016-04-08 13:31:04 2016-04-08 13:44:02
## 1458641: id1049543 1 2016-01-10 07:35:15 2016-01-10 07:46:10
## 1458642: id2304944 2 2016-04-22 06:57:41 2016-04-22 07:10:25
## 1458643: id2714485 1 2016-01-05 15:56:26 2016-01-05 16:02:39
## 1458644: id1209952 1 2016-04-05 14:44:25 2016-04-05 14:47:43
## passenger_count pickup_longitude pickup_latitude
## 1: 1 -73.98215 40.76794
## 2: 1 -73.98042 40.73856
## 3: 1 -73.97903 40.76394
## 4: 1 -74.01004 40.71997
## 5: 1 -73.97305 40.79321
## ---
## 1458640: 4 -73.98220 40.74552
## 1458641: 1 -74.00095 40.74738
## 1458642: 1 -73.95913 40.76880
## 1458643: 1 -73.98208 40.74906
## 1458644: 1 -73.97954 40.78175
## dropoff_longitude dropoff_latitude store_and_fwd_flag
## 1: -73.96463 40.76560 N
## 2: -73.99948 40.73115 N
## 3: -74.00533 40.71009 N
## 4: -74.01227 40.70672 N
## 5: -73.97292 40.78252 N
## ---
## 1458640: -73.99491 40.74017 N
## 1458641: -73.97018 40.79655 N
## 1458642: -74.00443 40.70737 N
## 1458643: -73.97463 40.75711 N
## 1458644: -73.97281 40.79058 N
## trip_duration filter
## 1: 6.120297 0
## 2: 6.496775 0
## 3: 7.661056 0
## 4: 6.061457 0
## 5: 6.075346 0
## ---
## 1458640: 6.656727 0
## 1458641: 6.484635 0
## 1458642: 6.638568 0
## 1458643: 5.921578 0
## 1458644: 5.288267 0
dataset <- rbindlist(list(train, test), fill = T)
The second p means Preparation. This step involves imputing NA’s, if any, turning numeric variables to factor variables and factor variables to numeric variables, as required by the ML algo. Here, we turn just the “store_and_fwd_flag” variable to numeric, from character.
dataset[, store_and_fwd_flag:= as.numeric(as.factor(store_and_fwd_flag))]
Clustering and predicting new clusters from the trained cluster model is very complicated in R. The default kmeans function trains on a dataset and predicts the clusters for that dataset only and doesn’t have any prediction mechanism for new datasets. This problem of predicting form the existing kmeans cluster model is solved by the “kcca” package which has the prediction function for a kmeans object. But the kmeans algorithm runs for a long time, like forever, and while the algo is running it feels that it might take a year for training kmeans and then an another year for predicting on new dataset.
We can get around these complexities by using defaut kmeans function to train the clustering algo on train dataset, and then predicting the clusters on new dataset by using “get.knnx” function from FNN package. I achieve this down below.
This chunk should run for maximum 3 minutes.
#pickup co-ordinates
i <- cbind(pick_longitude = train$pickup_longitude, pick_latitude = train$pickup_latitude)
set.seed(78)
idx <- sample(nrow(i), 0.50 * nrow(i))
#cluster <- kcca(k[idx, ], 70, family = kccaFamily("kmeans")) #Runs forever
#computing distance
set.seed(76)
pick_cluster <- kmeans(i[idx,], 70, nstart = 15)
#k_test <- cbind(pick_longitude = test[["pickup_longitude"]], pick_latitude = #test[["pickup_latitude"]], drop_longitude = test[["dropoff_longitude"]], #drop_latitude = test[["dropoff_latitude"]])
#train pickup cluster
train_pickup <- get.knnx(pick_cluster$centers, train[, list(pickup_longitude, pickup_latitude)], 1)$nn.index[,1]
train[,pickup_cluster:= train_pickup]
#test pickup cluster
test_pickup <- get.knnx(pick_cluster$centers, test[, list(pickup_longitude, pickup_latitude)], 1)$nn.index[,1]
test[,pickup_cluster:= test_pickup]
This takes less time than it takes to get pickup cluster.
#second set of coordinates
j <- cbind(drop_longitude = train$dropoff_longitude, drop_latitude= train$dropoff_latitude)
#clustering
set.seed(76)
drop_cluster <- kmeans(j[idx,], 70, nstart = 15)
#train dropoff clusters
train_dropoff <- get.knnx(drop_cluster$centers, train[, list(dropoff_longitude, dropoff_latitude)], 1)$nn.index[,1]
train[, dropoff_cluster:= train_dropoff]
#test dropoff clusters
test_dropoff <- get.knnx(drop_cluster$centers, test[, list(dropoff_longitude, dropoff_latitude)], 1)$nn.index[,1]
test[, dropoff_cluster:= test_dropoff]
We extract two distance features from the pickup and dropoff co-ordinates. First distance feature is a haversine distance calculated by pickup and dropoff co-ordinates
#train co-ordinates
i <- cbind(pick_longitude = train$pickup_longitude, pick_latitude = train$pickup_latitude)
j <- cbind(drop_longitude = train$dropoff_longitude, drop_latitude= train$dropoff_latitude)
#computing haversine distance from co-ordinates
train[, distance := distHaversine(i, j)]
#log of distance
train[, distance := log(distance+1)]
#test coordinates
i_test <- cbind(pick_longitude = test[["pickup_longitude"]], pick_latitude = test[["pickup_latitude"]])
j_test <- cbind(drop_longitude = test[["dropoff_longitude"]], drop_latitude = test[["dropoff_latitude"]])
#computing distance from co-ordinates
test[, distance := distHaversine(i_test, j_test)]
#log of distance
test[, distance:= log(distance+1)]
Converting the distance to log helps the prediction accuracy a lot. Let me show it with a plot.
train %>% sample_frac(0.003)%>% plot_ly(x =~trip_duration, y = ~distance, alpha = 0.3) %>% add_markers(marker = list(line = list(color = "black", width = 1))) %>%
layout(
title = "<b>Relation of journey time and distance</b>",
xaxis = list(domain = c(0.1, 1), title = "<b><i>trip duration(log)</i></b>"),
yaxis = list(title = "<b><i>distance(log)</i></b>"),
updatemenus = list(
list(
y = 0.8,
buttons = list(
list(method = "restyle",
args = list("type", "scatter"),
label = "Scatter"),
list(method = "restyle",
args = list("type", "histogram2d"),
label = "2D Histogram")))
))
We can see that we have quite a good linear relationship between log of distance and log of trip duration. Now see the same plot without log transformation of distance
train %>% sample_frac(0.003)%>% plot_ly(x =~trip_duration, y = ~exp(distance)-1, alpha = 0.3) %>% add_markers(marker = list(line = list(color = "black", width = 1))) %>%
layout(
title = "<b>Relation of journey time and distance</b>",
xaxis = list(domain = c(0.1, 1), title = "<b><i>trip duration(log)</i></b>"),
yaxis = list(title = "<b><i>distance</i></b>"),
updatemenus = list(
list(
y = 0.8,
buttons = list(
list(method = "restyle",
args = list("type", "scatter"),
label = "Scatter"),
list(method = "restyle",
args = list("type", "histogram2d"),
label = "2D Histogram")))
))
Here, we calculate the distance of dropout points from that of central park, NY. We use the distCosine function, we use the coordinates(got from web) of central park as the first argument and then the dropoff coordinates to get the distance.
central_park <- c(73.9654,40.7829)
train[,distance_to_central:= distCosine(j, central_park)]
test[, distance_to_central:= distCosine(j_test, central_park)]
h <- train %>% sample_frac(0.003, replace = F)
highchart() %>%
hc_title(text = "Simple scatter chart") %>%
hc_xAxis(category = h$trip_duration) %>%
hc_add_series(data = h$distance_to_central, type ="scatter")
Several pickup latitudes and longitudes deviate just a little bit from each other, implying that they are actually the same location but with few deviations. that is why, we did the clustering, earlier, so to get the actual counts of pickup points, we round the latitudes and longitudes to 3 digits.
#Rounding train pickup and dropoff co-ordinates
train[,":=" (pick_lat = round(pickup_latitude, 3),
pick_long = round(pickup_longitude,3),
drop_lat = round(dropoff_latitude, 3),
drop_long = round(dropoff_longitude,3))]
#Rounding test pickup and dropoff co-ordinates
test[,":="(pick_lat = round(pickup_latitude, 3),
pick_long = round(pickup_longitude,3),
drop_lat = round(dropoff_latitude, 3),
drop_long = round(dropoff_longitude,3))]
#pickup and dropoff points
train[,":="(pickup_point= paste(pick_long, pick_lat),
dropoff_point = paste(drop_long, drop_lat))]
test[, ":="(dropoff_point = paste(drop_long, drop_lat),
pickup_point = paste(pick_long, pick_lat))]
#pickup and dropoff points counts
train[, pick_count:= .N, by = pickup_point]
train[, drop_count:= .N, by = dropoff_point]
test[, drop_count:= .N, by = dropoff_point]
test[ , pick_count := .N, by = pickup_point]
#pickup and dropoff cluster counts
train[, ":="(pickup_cluster_count= .N), by = pickup_cluster]
train[, dropoff_cluster_count:= .N, by = dropoff_cluster]
test[, pickup_cluster_count:= .N, by = pickup_cluster]
test[, dropoff_cluster_count:= .N, by = dropoff_cluster]
Here, we average the trip duration variable by month, day and hour and make it into a new feature, similarly we do that with pickup_point variable created before this feature.
We can do this because train and test data is from the same period, if they were from different period, we wouldn’t be able to apply the same method there.
#train[, month_unique := length(unique(pickup_point)), by = pick_month]
#train[, wday_unique := length(unique(pickup_point)), by = pick_wday]
#train[, day_unique := length(unique(pickup_point)), by = pick_day]
#train[, hour_unique := length(unique(pickup_point)), by = pick_hour]
h <- train[, .(time_mean_trip = mean(trip_duration)), by = list(pick_month, pick_day, pick_hour)]
train <- merge(train, h, by = c("pick_month", "pick_day", "pick_hour"), all.x = T)
test <- merge(test, h, by = c("pick_month", "pick_day", "pick_hour"), all.x = T)
l <- train[, .(pick_mean_trip = mean(trip_duration)), by = list(pickup_point)]
train <- merge(train, l, by = "pickup_point", all.x = T)
test <- merge(test, l, by = "pickup_point", all.x = T)
The speed of a vehicle is linearly dependent on time taken by the trip. We find the speed by dividing the distance calculated by tri_duration and then multiply it by 1000 to get the speed in km/hr
train[, speed := 1000 * distance/trip_duration]
avg_speed <- train[,.(mean_speed = mean(speed, na.rm = T)), by = list(pick_long, pick_lat)]
train <- merge(train, avg_speed , by =c("pick_long", "pick_lat"))
test <- merge(test, avg_speed, by = c("pick_long", "pick_lat"))
merge the osrm features to train and test dataset by “id”. WE have train osrm dataset in two part, so we first bind them and then merge it with train dataset. We use only three numeric columns from that dataset, “total_distance”, “total_travel_time” and “number_of_steps”.
test <- merge(test, osrm_test[, list(id, total_distance, total_travel_time, number_of_steps)], by = "id", all.x = T, sort = F)
osrm_train <- bind_rows(osrm_t1, osrm_t2)
train <- merge(train, osrm_train[, list(id, total_distance, total_travel_time, number_of_steps)], by = "id", all.x = T, sort = F)
We have some NA’s in test dataset and very few in the newly appended features from osrm datasets in train. We impute them with NaN because the xgboost model, which we are going to use, handles NaN values separately.
colSums(is.na(test))
test[which(is.na(time_mean_trip)), time_mean_trip:= NaN]
test[which(is.na(pick_mean_trip)), pick_mean_trip:= NaN]
train[is.na(total_distance), total_distance:= NaN]
train[is.na(total_travel_time), total_travel_time:= NaN]
train[, number_of_steps:= as.numeric(number_of_steps)]
train[is.na(number_of_steps), number_of_steps:= NaN]
validation forms the important part of any competition, if your local validation is robust, then you can really catapult your prediction scores on Public LB without few submissions. The local validation becomes quite important when we have quite a big dataset and, it is not feasible to try out every new idea on the whole dataset and submit the prediction only to know that the idea doen’t worth your time.
Cross-Validation is always the first preferrence of every data scientist, but owing to the large size of this dataset, we drop the idea of cross validation. If you have good PC/laptop with good resources, you can try this but if you have machine configurations same as me, 4gb RAM and i5-4, then it is advisable to go with plain validation. Trust me, you don’t want to waste your time just looking at the cross-validation to start, let alone it being finished.
Just Validation means setting aside just a portion of your training data as validation dataset and train your data on the remaining train dataset and check your accuracy on the validation dataset before predicting on test dataset. This framework is not as solid as cross-validation but is, still, quite useful in these scenarios. We will use the plain validation framework in this kernel.
#predictors to use in training
predictors <- setdiff(names(train), c("id", "store_and_fwd_flag", "pickup_point", "filter", "pick_long", "pick_lat", "drop_lat", "drop_long", "trip_duration", "pickup_datetime", "dropoff_datetime", "speed", "dropoff_point"))
#setting seed is necessary for reproducibilty
set.seed(78)
index <- sample(nrow(train), 0.80 * nrow(train)) #a portion of train dataset
#sparse_train <- Matrix(as.matrix(train[index, predictors, with = F]), sparse = T)
#sparse_valid <- Matrix(as.matrix(train[-index, predictors, with = F]), sparse = T)
#sparse_test <- Matrix(as.matrix(test[, predictors, with = F]), sparse = T)
dtrain <- xgb.DMatrix(as.matrix(train[index, predictors, with= F]), label = train$trip_duration[index])
dvalid <- xgb.DMatrix(as.matrix(train[-index, predictors, with = F]), label = train$trip_duration[-index])
dtest <- xgb.DMatrix(as.matrix(test[, predictors, with = F]))
If you use NaN as the value for mean imputation, then you can’t turn your train and test datasets to sparse matrices, but if you’ve used some other number for NA imputation then you can use the sparse train and test matrices, it will speed up your training procedure, even mpre.
Xgboost is the most efficient and most famous ML Algorithm on kaggle. This scales better for big datasets without any compromise on the effectives or prediction power. This is a must know alorithm for any newbie data scientist. We train the model for 150 rounds and set some hyperparameters to our liking.
params <- list(
booster = "gbtree",
objective = "reg:linear",
eta=0.05,
gamma=0,
max_depth= 3,
subsample=0.8,
colsample_bytree=0.8,
eval_metric = "rmse"
)
#use this cv framework if you have satisfactory machine configurations
#set.seed(679)
#cvModel <- xgb.cv(params, dtrain, nrounds = 1000, early_stopping_rounds = 10, #maximize = T, nfold = 5, print_every_n = 10)
#validation feramework
set.seed(123)
xgbModel <- xgb.train(params = params, nthreads = 4, data =dtrain, early_stopping_rounds = 10, watchlist = list(train = dtrain, test = dvalid), verbose = 1, maximize =F, nrounds =150, print_every_n = 5)
If you want to see the feature importance, then use this code.
importance_matrix <- xgb.importance(feature_names = predictors, xgbModel)
xgb.plot.importance(importance_matrix = importance_matrix, top_n = 50)
We know build ourxgboost model with full data and with selected features.
#whole train dataset
dtrain <- xgb.DMatrix(as.matrix(train[, predictors, with = F]), label = train$trip_duration)
#setting seed and training
set.seed(123)
xgb <- xgb.train(params = params, nthreads = 4, data =dtrain, early_stopping_rounds = 10, watchlist = list(train = dtrain), verbose = 1, maximize =F, nrounds =150, print_every_n = 5)
After training, we predict on the test dataset.
prediction <- predict(xgb, dtest)
submit <- data.frame(id = test$id, trip_duration = exp(prediction))
write.csv(submit, "submit.csv", row.names = F)
As we did our prediction on log transformed trip duration, we exponentiate our prediction and reverse it back to the original scale.
This is the most crucial and important step for a newbie to learn. I am exspecially, more, excited to teach you how to build hybrid models. Hybrid Models come very handy, expecially in these circumstances when the dataset is in 100s of Mbs.
This is coming soon, please come back again.
Note - Don’t worry about low score, we are surely going to go below 0.4 without ensembling, then we do ensembling.
Please upvote, if you find it useful.